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Preface 



Standards-based reform and test-based accountability 
have come to be the principal approaches to educa- 
tion reform in the United States, evolving and gather- 
ing momentum over the last two decades. As these 
approaches become ever more important to raising 
achievement, and as accountability systems become 
the basis for substantial sanctions and rewards to 
schools, teachers, and students, it becomes critical 
that we use the measures that will get it right. 

The purpose of this report is to help in the evolu- 
tion of these systems by examining the measures used, 
including, but not limited to, tests. The author asks: 
Are these the best measures? Are they used right? Are 
there other measures that should be employed? It is 
the model of reform itself that is examined, and the 
report does not address specific laws and policies, 
whether they be at the district, state, or Federal level. 



It is hoped, however, that the report will be useful to 
all who frame such laws and policies. 

For those who are most interested in knowing what 
these recommendations are without delving into the 
supporting research, the Executive Summary reviews 
the report’s key recommendations with the expecta- 
tion that the reader will turn to the body of the report 
if more detail is needed. For those who wish a little 
more detail, the Report in Brief offers a distillation of 
the research and recommendations contained in the 
full report. For those who want to obtain the most 
complete knowledge, the body of the report discusses 
the supporting research at length and provides numer- 
ous references for further exploration. The author has 
offered, then, a full measure of accommodation to 
readers’ interests and needs. 

Michael T. Nettles 
Vice President 
Policy Evaluation and 
Research Center 
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Executive Summary 



At the different governmental levels where standards- 
based reform and test-based accountability are used 
as approaches to education reform, there is unfinished 
business. More and better measures are needed to 
make these approaches more effective and credible, 
and we need to be more measured in the criteria used 
forjudging results. 

■ If we think of accountability as a structure, then 
the foundation of that structure consists of four 
walls: the content standards, performance stan- 
dards defining expected levels of attainment, the 
curriculum (and the teaching of that curriculum), 
and the test. In many places where tests are used 
for accountability purposes, however, the align- 
ment of these four walls is very often deficient in 
one aspect or another. When this is the case, the 
foundation is too weak to support the desired prog- 
ress in achievement, the assignment of failure, the 
granting of rewards, or the application of sanctions 
that ensue from the accountability process. To rem- 
edy the situation: 

• States and districts pursuing test-based account- 
ability must take advantage of the knowledge 
currently available to fix the structure — that is, to 
determine whether proper alignment exists and 
to improve where needed. 

• The tendency to cut to the chase and use test 
scores whether or not the required alignment is 
accomplished needs to be overcome. 

■ Typically, a single point on a scale — such as “ad- 
equate” or “proficient” — is used to interpret test 
results in accountability systems. But this approach 
is too limited to represent student achievement and 
progress adequately. It ties accountability only to 
the movement of a relatively few students around a 
point on a scale. Furthermore, the processes used 
to locate that single cutpoint do not produce a 
transparent alignment to the content standards. In 
addition to the percent “proficient,” other infor- 
mation drawn from the test needs to be used to 
reflect what is happening all along the achievement 
distribution in a classroom, a school, a district, or a 
state. The results should make it possible to answer 
questions such as: 



• Have the average scores of students overall, and 
of those in various subgroups, improved over 
time? 

• What gains, if any, have been made by low-per- 
forming students (e.g., those in the bottom fourth 
of the performance distribution)? By high-per- 
forming students? 

• Is there evidence that the achievement gaps 
between students in various subgroups (e.g., by 
race/ethnicity or income) have narrowed? 

■ The level of achievement that students attain is the 
result of many factors, including not only what 
happens in school, but also what has happened 

in early childhood, home life, and after school. To 
hold schools and teachers accountable, we need to 
measure the results of what they do while students 
are in school. To do so means measuring the growth 
or gain in learning during the school year, and de- 
termining how that changes over time. 

• This “value-added” approach has been applied in 
some places, as well as in large research studies. 

• There are technical problems to overcome, 
however. 

• Schools showing success or failure as measured 
by the level of student achievement are very often 
not the same schools as those whose success or 
failure is measured by growth or gain in achieve- 
ment during the school year. Measures of both 
level and gain are needed. Standards can be set 
for acceptable growth, as they are now for level. 

■ While the phrase “teaching to the test” is often used 
to refer to the problem of the curriculum being nar- 
rowed to what is tested, the phrase has taken on so 
many different interpretations and meanings that it 
is no longer useful. 

• Lack of agreement exists on what constitutes 
good and bad practice in preparing students for 
tests. Different standards are applied in differ- 
ent places, and there has been a lack of clarity in 
conveying to principals and teachers the desired 
and undesired practices. 
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• Legitimate concern exists about whether test- 
ing in only a few subjects — with high stakes 
attached to the results — impacts the curriculum. 
The concern is whether some subjects, or topics 
within a subject, get slighted. The distribution of 
instructional time needs to be measured regularly 
to gauge both intended and unintended changes. 

Accountability assessment is front and center in the 
education reform movement. While there has been 
a lot of rhetoric about how such assessment can 
help teachers identify students’ individual needs, 
the tests are typically given at the end of the year, 
too late to inform instruction during the year. 

• What is needed is an expansion of diagnostic 
assessment use for helping teachers understand 
and address student needs. 

• Research has demonstrated that such use of 
assessments can significantly raise student 
achievement. 



■ The reform movement needs to broaden its atten- 
tion from mostly focusing on quality to encompass- 
ing concerns about quantity. The actual rates of 
high school completion (ending with the award 
of a regular diploma) are lower than official rates 
disclose, and they declined for most states in the 
1990s, according to the estimates of independent 
analysts. Measures must be improved at the level of 
the school, the district, the state, and the nation. 

The bottom line is this: Dealing with unfinished busi- 
ness is essential in order to: 

• Be more effective in reaching the important goals 
that have been set 

• Maintain credibility in the use of assessments as 
a lynchpin in the reform movement 

• Avoid unintended consequences, and 

• Measure whether intended consequences have 
been achieved. 



The Report in Brief 



Education reform has proceeded apace over the last 
decade and a half, first taking the shape of what came 
to be called standards-based reform, and then merging 
with test-based accountability. The general model has 
similarities from place to place, but with considerable 
differences among the states. A recent evaluation of 
30 states by the Fordham Foundation, based on six 
criteria, judged that the systems were only “fair” on 
average. 

The assumption of this report is that the model 
is — and should be — still evolving based on experience, 
evaluation, and research. The purpose here is to add to 
the knowledge and information available for this evo- 
lution. It is not to critique or to advocate any particu- 
lar local, state, or federal formulation. 

Many factors are involved in improving schools 
and teaching, and sometimes there are competing 
strategies for doing so. This review is by no means all 
inclusive, although it does suggest some enlargement 
of the scope of what has become the standards and 
testing approach. The goal is to improve the measures 
of success in judging students, schools, and teachers, 
in determining whether intended consequences are 
being attained and unintended consequences are being 
avoided, and in providing more information that will 
help teachers improve instruction. 

Alignment: A Necessary Condition 

The tests used for measuring progress in a standards- 
based reform system must be closely aligned with the 
content standards that specify what students should 
know and be able to do. This is the cornerstone of 
such systems. If alignment is out of whack, there can 
be no confidence that changes in the scores are a 
valid measure for accountability. The same is true as 
regards alignment of the curriculum to the content 
standards, and of the tests to the actual instruction. 

Of course, the need for alignment between actual 
instruction and the content standards goes beyond 
interpreting the test scores. The purpose of the content 
standards is to shape instruction and curriculum. If 
that is not happening, standards-based reform is not 
working, and students can hardly be expected to do 
well on tests where instruction is not aligned with the 
content standards. 



Alignment of Tests to Content Standards. In 

a project coordinated by the Council of Chief State 
School Officers and led by Norman Webb, research- 
ers have developed a systematic approach to facilitate 
alignment and check to see if it exists. Four criteria for 
alignment are involved. 

• Categorical Concurrence — the extent to which both 
standards and the test incorporate the same con- 
tent. 

• Depth of Knowledge Consistency — the extent to 
which what is elicited from the students on the 
assessment is as demanding cognitively as what 
students are expected to know and do as stated in 
the standards. 

• Range of Knowledge Correspondence — the extent 
to which a comparable span of knowledge expected 
of students by a standard is the same as, or corre- 
sponds to, what students need to correctly answer 
the test questions. 

• Balance of Representation — the degree to which 
one objective is given more emphasis than another. 

A somewhat similar set of criteria developed by 
Achieve, Inc. and Lauren Resnick has been used by 
that organization in a number of states to help them 
with alignment. In a five state study using these cri- 
teria, the authors concluded that although the states 
tended to limit their tests to material that is in the 
standards, and that — with some corrections — the test 
items were generally aligned to the objectives, 

. . . the good news ends here. With few excep- 
tions, the collection of items that make up the 
tests we examined do not do a good job of as- 
sessing the full range of standards and objectives 
that the states have laid out for their students. 
What is included and excluded is systematic: the 
most challenging objectives are the ones that are 
under-sampled or omitted entirely. Thus, many 
of the tests in use by a state cannot be judged to 
be aligned to the states’ standards — even though 
most of the items map to some standard or 
objective. 

In a comprehensive study of the implementation 
of standards in 2001 by the American Federation of 
Teachers (AFT) — a leading proponent of standards and 
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tests with consequences — the conclusion was that 44 
percent of states have tests that are not aligned to the 
standards. Yet alignment between the test and the con- 
tent standards is critical for the test to be valid in its 
use in test-based accountability as part of a standards- 
based reform system. 

When the Fordham Foundation rated six aspects 
of standards systems in the aforementioned study, one 
of which was alignment of the test to the state content 
standards, they found that the average rating for the 
22 states was “fair,” a 3 on a 5 point scale. Three states 
scored the high of 3.8, and two states were very low, 

1.5 in Flawaii and 1.8 in New Mexico; in six of the 
states studied, not enough information was found to 
make a judgment. The lack of such information does 
not bode well for the possibility of alignment. 

Alignment of Instruction to Content Standards. 

Such alignment is critical, and if the curriculum 
actually used in the classroom is faithful to the con- 
tent standards, the test can help tell if the content 
standards are being mastered. States are at different 
stages of adjusting the curriculum and instructional 
materials to the content standards. The Survey of the 
Enacted Curriculum Project, carried out in 1 1 states, 
provides approaches to analyzing the degree of fit 
between what is actually being taught and the content 
standards. Among the questions asked was: Flow does 
math and science content taught in classes compare 
to the goals outlined in state and national standards? 
The answers: 

• In middle grade math and science, most recom- 
mended standards are covered, but the level of ex- 
pectation and depth of coverage vary widely among 
schools and classes. 

• Data reveal differences in extent of teaching sci- 
ence content across the standards and the extent of 
articulation between the grades. 

Another conclusion of the AFT 50-state study was 
that fewer than one-third of the tests in use are sup- 
ported by adequate curricula. 

Alignment of Actual Instruction with the Test. 

The 1 1 -state study referenced above also investigated 
whether state assessments reflect what is being taught 
in classes. The study found that state assessment items 



cover a more narrow range of expectations for stu- 
dents than reported instruction, with tests focusing 
more on memorizing facts and performing procedures 
than on solving novel problems and applying skills and 
concepts. Teaching is broader than what is on the test, 
and should be. It is the test that needs to be changed. 

In two states, it was possible to map the curriculum 
actually taught with the state test, enabling research- 
ers to draw the following conclusions. For mathemat- 
ics, “less than half of the intersections of content 
topics . . . reported by teachers were in common with 
the assessment items found on the state mathematics 
test.” The same was found for science. The authors 
of the 1 1 -state study say the results “can provide a da- 
tabase for monitoring the degree to which classroom 
curriculum is moving toward the standards.” 

The Passing Score: Performance Standards and 
Tracking Progress 

The rubber hits the road in test-based accountability 
when the “passing score” is established. This is typi- 
cally a point on a scale where a test score reaches or 
exceeds some level labeled “proficient;” this becomes 
the performance standard in the system of account- 
ability. Such performance standards are supposed to 
be “aligned” with content standards, with performance 
standards somehow derived from the content stan- 
dards, but an eminent scholar in the field contends 
that no approach has been developed for doing this 
directly. Several methods are used to set these stan- 
dards, and have been around a long time. These are 
reasonably good when the purpose is to set a score for 
a particular occupation where experts from that occu- 
pation make the judgments, and where there is some 
consistency in the judgment they make. But those con- 
ditions do not exist in setting outpoints, for example, 
for eighth grade mathematics: 

• The outpoints for setting performance levels on the 
tests of the National Assessment of Educational 
Progress (NAEP) have been labeled as “fatally 
flawed” by the National Academy of Education and 
the National Academy of Science. Therefore, by 
direction of Congress, each NAEP report contains 
a warning label indicating that the performance 
levels are “developmental.” 
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• Different methods available for setting perfor- 
mance standards can produce quite varying results. 
Kentucky used three different methods with the 
result that either 56.6 percent, 29.4 percent, or 15.3 
percent were at or above the proficient level. Ken- 
tucky took these results into account in deciding on 
a level to be required by the state. 

No matter which one of the available methods is 
used, there is nothing in these procedures that pro- 
vides a direct link to the content standards in terms of 
knowing what the score means with respect to how 
much of the required knowledge has been mastered. 
Beyond that, using a single cutpoint on a scale is in- 
adequate as a sole representation of performance on a 
test. 

• Using a single cutpoint for accountability means 
that decisions are based only on the movement up 
and down of a relatively few students below and 
above the cutpoint. Progress of the rest of the stu- 
dents is ignored. 

• Improvement or deterioration in the achievement 
of students above the cutpoint, as well as those 
below the cutpoint, is not revealed by the account- 
ability systems typically in use. The distribution 
of performance in the United States is very wide — 
wider than in any other developed country — and 
the distribution of scores within a school may be 
very wide, as well. We should therefore be con- 
cerned about whether the bottom quarter, or the 
top quarter of students, for example, are losing or 
gaining ground. 

• Use of a single such cutpoint can be very mislead- 
ing about the performance of an education system. 
(Of course, a single test, however used, should not 
be the sole basis of making high-stakes decisions.) 

Take the case of NAEP assessment results in 
mathematics for Mississippi from 1992 to 1996. No 
improvement was seen over that period in the percent 
of students reaching the level of “proficient”’ as defined 
by NAEP. However, 

• The average score improved; 

• The average score for the bottom quartile im- 
proved; 

• The average score for the top quartile improved; 
and 



• The gap between the top and bottom quartile was 

reduced. 

Broader measures are therefore needed to capture 
the level of achievement of students and to compare 
changes in the levels of achievement over time. There 
are numerous options. One option would be to develop 
a composite of measures, which might include but 
not be limited to the percent reaching the proficient 
level. For an example, see Table 2 on page 23 showing 
several indicators of changes in eighth grade science 
achievement: the average score, the percent “profi- 
cient,” the average for the bottom quartile, the aver- 
age for the top quartile, the gap between the top and 
bottom quartiles, the gap between White and minority 
students, and the gap between the poor and non-poor 
students. In the case of eighth grade science, while no 
state declined on the basis of the percent proficient, 
several had declines in the bottom quartile, 1 1 had an 
increase in the gap between the top and the bottom 
quartiles, and 7 had an increase in the gap between 
poor and non-poor students. 

Accountability For Growth Due To Schooling 

Indicators of the level of student achievement over 
time, discussed above, are more relevant for measur- 
ing progress for the nation or state or community as 
a whole than for gauging school and teacher effective- 
ness. No matter which indicators are used, compar- 
ing this year’s eighth graders with past years’ eighth 
graders is a limited way to evaluate school effective- 
ness if there is any change in the demographic make- 
up of the eighth grade class, or if one class was either 
better or more poorly prepared than the other when it 
entered the eighth grade. 

A lot of factors enter into how much mathematics 
a student knows at the end of the eighth grade, and a 
lot of factors enter into what this year’s eighth graders 
know compared to eighth graders five years ago — fac- 
tors that have nothing to do with how well teachers 
taught over the prior school year. Schools judged 
adequate when measured by changes in the level of 
student achievement over time, as is typical at present, 
are often not the same schools that are judged ade- 
quate by changes in the amount of growth or gain that 
take place within the school year, sometimes called the 
value added. 
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Among the states participating in NAEP from 1 992 
to 1996, Maine had the highest scores in fourth and 
eighth grade mathematics, and Arkansas had the 
lowest scores. Yet students in Maine and Arkansas 
both gained 52 scale points from the fourth to the 
eighth grade. So, does Arkansas have as effective a 
school system as Maine? 

Tennessee for well over a decade has had a system 
based on gain scores as well as on levels of achieve- 
ment. For contrast, see the results in Bradley 
County for 2003 on grades given by the state for 
performance in six subjects. The grades for the 
level of achievement were one A, three Bs and two 
Cs. For gain, the county had two Cs, two Ds and 
one F. In other words, the county had high achiev- 
ing students, but they were lagging in how much 
they were improving. The opposite was true in 
other counties. The Tennessee system, which was 
changed recently by the state, represents only one 
of several ways to measure gain. 

A recent research study conducted in a large urban 
school district in southwestern United States exam- 
ined middle school students’ test scores. When the 
researchers compared the difference between the 
mean achievement of students in the same grade 
over time, and the growth in achievement of the 
same students, they concluded that: 

evaluations of school performance differs de- 
pending on whether school mean achievement 
or school mean growth are examined . . . Evalu- 
ation of these estimates showed that the school 
mean performance was not strongly predictive 
of the school mean rate of growth . . . character- 
ization of school performance is substantially 
different depending on whether mean achieve- 
ment or mean growth is examined. 

The Consortium on Chicago School Research, led 
by Anthony Bryk, has been using a “Learning Gain 
Index” to produce a school productivity profile for 
about a decade. According to Bryk, the approach 
stems from a belief that “a school should be held 
responsible for the learning that occurs among 
students actually taught in the schools.” 

A study of 230,000 students by the Northwest 
Evaluation Association found that many schools 



with high scores had low growth in achievement 
during the year. 

Given these problems, many researchers have advo- 
cated the use of gain or “value-added” measures to 
evaluate the effectiveness of schools and teachers. 

• Goldstein, describing school effectiveness studies 
in Britain, indicates that “it is now recognized that 
intake achievement is the single most important 
factor affecting subsequent achievement, and that 
the only fair way to compare schools is on the basis 
of how much progress pupils made during their 
time in school.” 

• Walberg points to an increasing recognition “that 
value-added scores better indicate a school’s or 
teacher’s contribution to achievement than do test 
scores at a single point in time . . . non-value added 
scores, however, can complement value added 
scores, and together, they give policy makers more 
information and are less misleading than either one 
alone. 

• Lowery and Kubzdela write: “Currently, the most 
accurate, accepted and utilized method for measur- 
ing teacher quality is value added tor gain analysis 

. . . value-added analysis follows the progress of 
individual students by tracking changes in their test 
scores from year to year.” 

The measure of growth and gain is necessary for 
holding schools and teachers accountable and deter- 
mining their effectiveness. However, there are choices 
to be made as regards the best way to do this and 
methodological problems to be dealt with. The mea- 
sure of level of achievement and its change over time 
tells us how well the state, community, or nation as 
a whole is doing with regard to progress in educa- 
tion achievement. Both measures are needed. And 
as Stephen Raudenbush cautions us in a new report 
from the ETS Policy Information Center, a test alone, 
whichever approach is used, needs to be supplemented 
by other information if high-stakes decisions are to be 
made. 

Teaching and the Test 

Two aspects are examined. The first is how instruction 
is changed to prepare students in tested subjects. The 
second is what happens to instruction in non-tested 
subjects. 



Teaching in Tested Subjects. With 1 1,000 entries 
in Google, the term “teaching to the test” has been 
used to denote so many different situations that it 
has become virtually useless in conveying any com- 
mon meaning. There are shades of gray in how much 
instruction is specifically tailored to what appears or is 
expected to appear on a test, and differing judgments 
about what is desirable, what is educationally sound, 
and what is legitimate versus what is cheating. It is not 
a simple matter for the teacher to know what to do, or 
for a principal to know what to encourage teachers to 
do. When a test score has real consequences attached 
to it, distortions in or departures from “regular” teach- 
ing can be expected. 

• Efforts to rank practices from “good” to “bad” 
disclose disagreements. Readers have to make their 
own judgments, outright cheating aside. Although 
the use of practice tests was ranked as bad by some 
in a national survey, over half of teachers used such 
tests from their state a great deal (24 percent) or 
somewhat (28 percent) to help students prepare for 
the state test. While instructing students in test-tak- 
ing skills was ranked as suspect by others surveyed, 
most teachers used this test-preparation approach 
a great deal (45 percent) or somewhat (46 percent). 

• Clarity by education officials up and down the 
hierarchy is critical, so that teachers operate in a 
situation of known standards and expectations. 

• Where there is poor alignment between the test 
and curriculum actually in use, a not infrequent 
case, and where there is limited alignment between 
the test and the state content standards, then even 
when the curriculum is aligned to the standards, 
teachers are on the horns of a dilemma. How do 
they deliver on test scores without finding ways to 
prepare students for material not covered? What is 
and is not appropriate under such circumstances — 
circumstances not created by the teacher? 

• An understanding is needed of when a score result 
can be relied upon to represent a degree of achieve- 
ment of the standards and when it cannot. When 
does a blood pressure reading represent real blood 
pressure and not a distortion due to the way it is 
taken? More measured approaches can be used 

to check whether the test results from these large 



scale test operations really represent the full do- 
main established by content standards. 

Teaching in Subjects Not Tested. Concern is 
frequently expressed about whether subjects not cov- 
ered by test-based accountability are being neglected. 
Whether the objective is to continue the emphasis on 
the other subjects or to reduce instructional time in 
them, measures are needed that permit judging what 
is happening. And educational authorities need to be 
clear about what they do and do not expect. 

• In Anne Arundel County, Maryland, after the school 
administration reduced some middle school offer- 
ings, the Coalition for Balanced Excellence in Edu- 
cation succeeded in getting the decision overturned 
at the state level. 

• In North Carolina, the past president of the state’s 
Council for Social Studies said that, with testing in 
just a few subjects, “. . . social studies is left behind 
because there is not testing.” 

• In 2003, The Wall Street Journal ran an article en- 
titled “Schools Say ‘Adieu’ to Foreign Languages.” 

• A Boston College survey found that in states with 
high-stakes testing, one-fourth of the teachers re- 
ported cutting back in untested subjects. 

• A four-state study by the Council on Basic Educa- 
tion found increases in instructional time devoted 
to subjects tested, and decreases in many subjects 
not tested. 

A more measured approach involves the kind of 
tracking of the distribution of instructional time that 
would permit education policy makers to assure them- 
selves that there are not unintended consequences for 
subjects not covered by test-based accountability — and 
intended redistributions in tested subjects. 

Assessment to Inform Instruction 

In the current standards-based reform model, testing 
is used for accountability, with tests given at the end of 
the school year to evaluate teachers and schools. From 
its earliest beginnings, however, testing has had the 
promise of being used during the school year to inform 
instruction and help teachers identify and address 
students’ individual instructional needs. Such test- 
ing — “formative” and “diagnostic” — needs to be given a 
central role in education reforms. 
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A synthesis of research on the achievement effects 
of such assessment to inform instruction reveals a 
substantial impact, with “effect” sizes ranging from 0.4 
to 0.7. 

• An effect size of 0.4 would mean that the average 
pupil involved in an innovation would record the 
same achievement as a pupil in the top 35 percent 
of those not involved. 

• An effect size gain of 0.7 in the recent international 
comparative studies in mathematics would have 
raised the score of a nation in the middle of the 
pack of 41 countries (e.g., the United States) to one 
of the top five. 

The standards-based reform movement has not 
specifically encompassed the use of testing for these 
formative and diagnostic purposes. With increases oc- 
curring in the use of tests for accountability, such use 
could even be diminishing. There are many examples, 
however, of this testing. 

• The Council of Chief State School Officers found 
that one distinguishing characteristic of the five 
high-performing schools they studied is that the 
staff at each school use standardized assessment 
data “to identify areas where students can improve 
and where their own teaching strategies can be 
adjusted to meet students’ needs.” 

• A California study identified 16 schools with high 
performance of minority students and 1 6 with low 
performance, all schools with similar socioeco- 
nomic characteristics. One key characteristic of the 
successful schools, as compared with the others, 
was the frequency of testing to guide individual stu- 
dent instruction. 

Measuring School Completion 

Standards-based reform and test-based accountability 
are focused on raising achievement levels of students. 
In that context, good statistics are needed on how 
many students leave school without getting a regular 
high school diploma, and education reform ought to 
be about quantity as well as quality. The questions are 
always asked: Will the higher standards result in more 
students leaving school? Will higher standards keep 
more students in? 



We need to know more about the terms of trade 
between higher standards and school completion, so 
that informed decisions can be made. Of course, even 
when the terms are known, judgments will differ on 
how to strike the balance. While much is written, little 
is known for certain about whether there has so far 
been any widespread impact on school completion 
with a diploma. Schools and students are, as might be 
expected, under considerable pressure. 

• Independent estimates of non-completion rates at 
the national level indicate that completion rates 
have fallen over the last decade. But what this is 
related to has not yet been established. By itself, it 
is a very serious matter. 

• There are instances of students being dismissed 
from school because of their poor prospects for 
meeting standards — or, more likely, transferred to 
a GED preparation class where their achievement 
scores are not counted in the accountability sys- 
tem. 

• New York City recently settled a lawsuit in which 
one of its schools was charged with discharging 
poor performing students. These students are being 
readmitted. Two other suits are still pending. 

Official government measures, whether state or 
federal, have come in for considerable criticism from 
researchers over the past several years. These rates 
typically have not shown a decline, for a number 
of reasons. This report explains and compares high 
school completion estimates, state by state, using 
five different methods, including one by the National 
Center for Education Statistics and one by this author. 
It also shows the completion rates submitted by the 
states to the Department of Education in September 
2003, as required by the No Child Left Behind Act. A 
few examples will illustrate the comparisons. 

• For Georgia, the NCES calculates a high school 
completion rate of 71 percent, compared with 54, 
54, 57, and 58 percent under the four independent 
methods; under NCLB, Georgia reported 62 per- 
cent. 

• For New York, the NCES reported 82 percent, com- 
pared with 70, 60, 67, and 65 percent; under NCLB, 
New York reported 75 percent. 
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• Much less variation occurred in some states. For 

example, in Idaho, the NCES rate was 77 percent, 

compared to 78, 75, 71, and 73 percent; under 

NCLB, Idaho reported 77 percent. 

As for state-by-state trends, the NCES estimates 
have been the only ones available, and they are not 
available for all states. Few states were in this system 
a decade ago, so trend comparisons over a decade 
are not possible. This author made estimates for the 
period from 1990 to 2000. While five states raised their 
high school completion rates over that decade, the 
general pattern was one of declining rates: 

• 16 states declined up to 3.9 percentage points; 

• 18 states declined from 4 to 7.9 percentage 
points; 

• 9 states declined from 8 to 1 1.9 percentage 
points; and 

• 1 state declined 13 percentage points. 

Challenges and possibilities are offered in the 
report. The U.S. Department of Education is now 
working to improve the estimates, as should each 
state. Education reform should be both about the qual- 
ity and the quantity of educational achievement. More 
measured approaches need to be applied to the matter 
of quantity. 



In order for the standards-based reform model 
to remain credible and continue to evolve, it will be 
necessary to attend to unfinished business. Content 
standards must be fully translated into curriculum 
and instruction, all components of the system must 
be aligned, and accountability assessment approaches 
must be used that better measure what is learned 
based on what is taught in the classroom and that 
measure gain in achievement as well as level. Clear 
understandings need to be conveyed to schools and 
teachers about what is correct and educationally 
sound in getting students ready for tests. Standardized 
testing for informing instruction during the school 
year needs to become an integral part of the model, as 
does measuring the distribution of instructional time 
among subjects and tracking changes — particularly 
among tested and untested subjects. The quantity as 
well as the quality of education needs to be attended 
to, as well, with better measures of high school com- 
pletion and a better understanding of why completion 
rates are so low — and falling. 

The bottom line is that we badly need better and 
more comprehensive measures if we are to have an ac- 
curate picture of how the system is functioning, where 
more effort is needed, or where policy adjustments are 
required. What is at stake are effectiveness, credibility, 
and the avoidance of unintended consequences. 
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Introduction 



Over the last 15 years or so, the standards-based re- 
form model, increasingly accompanied by test-based 
accountability, has emerged to become the principal 
approach to improving public education in the United 
States. According to this model, education systems 
must determine what students should know and be 
able to do (content standards), decide how much 
competence they should demonstrate (performance 
standards), align curriculum and instruction with the 
content standards, and conduct testing to measure if 
students are learning the desired content. 

The rigor and style with which standards-based 
reform is being implemented in different states and 
localities have varied considerably. In some places, 
the model is being carefully applied. In other places, 
it is being done poorly, and unintended consequences 
have resulted. The No Child Left Behind (NCLB) Act, 
passed in 2001, spurred the model’s dissemination and 
will probably result in its being applied more uniform- 
ly across the states, particularly in the use of testing 
for accountability purposes. 

This author, in previous publications for the ETS 
Policy Information Center, has traced the emergence 
of the standards-based reform model, as well as the 
role of assessment in U.S. education. 1 It was the un- 
derlying assumption of these reports that the model, 
which originated in the 1980s with the standards 
established by the National Council of Teachers of 
Mathematics, was a good one that could be applied at 
the school, district, or statewide level. These reports 
also had a considerable amount to say about the many 
variants of the emerging model, and the constructive 
use of standardized testing within it. 

This new report similarly accepts the current 
reform model and its four main components: content 
standards, performance standards, curriculum and 
instruction, and testing. How these are structured, and 
at what level of governance, leaves a lot of area for 
debate and disagreement, however. 



If standards-based reform is the house that is being 
built, then this analysis is intended to aid in the design 
and construction of the building. The aim is to lay a 
strong foundation so that the structure is not continu- 
ally swaying in the winds of disillusion, distrust, and 
disgruntlement. Thus, this report addresses the stan- 
dards-based reform model generally, rather than its 
formulation by one district or one state or the nation 
as a whole. This is not, for example, a critique of the 
now long standing Kentucky model, or of the NCLB 
Act. The title, “More Measured Approaches,” refers to 
the need for measured steps based on a thorough un- 
derstanding of potential missteps and consequences. It 
also refers to ways of monitoring whether the expected 
changes are, in fact, occurring, as well as whether side 
effects that are not intended are emerging. 

Specifically, the report addresses six key areas. 

First is the alignment of the test with the state con- 
tent standards, the actual curriculum in the classrooms 
with the content standards, and the test with what is 
actually taught. Of course, complete alignment of the 
first two would assure the third. Without alignment, 
the meaning of test scores, and changes in scores over 
time, are called into question, as are accountability 
decisions made with such testing. Test scores are being 
used, with serious consequences, where alignment 
does not fully exist. This, too, will have serious conse- 
quences. 

Second are the processes used to define the level of 
student performance that is sought, and to determine 
the adequacy of performance on the accountability 
tests, and to indicate whether progress over time is 
or is not being made. There are, it is argued, serious 
weaknesses in current practice. 

Third are considerations involved in assigning 
responsibility for student achievement, and lack of it, 
through accountability testing. How is the effectiveness 
of teachers and schools being measured? How can 
measures be developed that hold teachers accountable 
for what they teach in a given year, as opposed to mea- 
sures that reflect all that students learned from prior 
teaching or life experiences? 



1 Paul E. Barton, Too Much Testing of the Wrong Kind ; Too Little of the Right Kind in K-12 Education, Policy Information Perspective, Policy 
Information Center, Educational Testing Service, March 1999; Paul E Barton, Facing the Hard Facts in Education Reform, Policy Informa- 
tion Perspective, Policy Information Center, Educational Testing Service, July 2001; and Paul E. Barton, Staying on Course in Education 
Reform, Policy Information Perspective, Policy Information Center, Educational Testing Service, April 2002. 
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Fourth is “teaching to the test, ’’ and the need to as- 
sure that the education sought is actually occurring, 
as interpreted through test scores, based on validity 
studies. Also, there is the question of the desired and 
undesired effect of accountability testing on the scope 
of what is taught, with particular attention to differ- 
ences between tested and untested subjects. 

Fifth is an element that is not specifically addressed 
by the model: the use of testing for more than account- 
ability. This concerns the use of tests in the classroom 
during the school year to aid instruction. The terms 
frequently used are “formative” and “diagnostic” as- 
sessment, as contrasted with “summative” assessment 
used for evaluation of schools and teachers in an ac- 
countability system. 



Sixth is the systematic assessment of school comple- 
tion. Because current measures are inadequate, 
too much remains unknown about the relationship 
between higher standards and graduation rates. Some 
believe that higher standards will increase the num- 
ber of students dropping out (or being “pushed” out), 
while others dispute this. Some believe that higher 
quality is desirable even if fewer students may achieve 
at the higher level. 

The following sections of this report address these 
issues in turn, with the goal of assisting those who are 
constructing laws and policies related to the stan- 
dards-based approach. The underlying assumption 
is that the model is — and should be — continuing to 
evolve, based on experience and expanding knowledge. 
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Alignment: A Necessary Condition 



The tests used for measuring progress in a standards- 
based reform system must be closely aligned with the 
state or district content standards that specify what 
students should know and be able to do. This is the 
cornerstone of such systems. If the alignment is out of 
whack, there can be no confidence that changes in test 
scores are a valid measure of accountability. 

Even if the test is closely tied to the content stan- 
dards, however, performance on the test is meaning- 
less if the curriculum and instruction do not give 
students the opportunity to achieve those standards. 
Students cannot be expected to do well on a test that 
measures content they have not been taught. It is 
therefore also essential for curriculum and instruction 
to be aligned with the content standards and the test. 

While the integrity of the accountability system 
depends on proper alignment, and there are tools and 
approaches for achieving it, there are undoubtedly 
challenges to be met. Given that content standards 
tend to be quite broad and varied, for example, how 
can a single test encompass it all? We want teaching to 
be broad, not narrowed because of problems in trans- 
lating between standards and tests. 2 

This section addresses methods developed to create 
the necessary alignment and measure the degree of lit. 
It also discusses the still limited (but growing) body 
of information on the extent to which alignment has 
been achieved. 

Of course, the need for alignment goes far beyond 
giving meaning to the test scores. The purpose of 
the content standards is to shape instruction and the 
curriculum. If that is not happening, standards-based 
reform is not working. 

Alignment of Tests to Content Standards. As not- 
ed earlier, alignment between the test and the content 
standards is critical for the test to be valid in its use as 
part of a standards-based accountability system. The 
purpose of the test is to measure the degree of achieve- 
ment of the standards. The point is made forcefully 



in the American Educational Research Association’s 
(AERA) Research Points : 

Today’s calls for alignment are built upon a 
foundation of more than 70 years of research on 
the development, evaluation, and use of tests. 
Standards for Educational and Psychological 
Testing, the recognized authority on educational 
testing, stresses that a “valid” test must show 
that it actually measures the constructs — knowl- 
edge, skills, abilities, processes, and characteris- 
tics — it was intended to measure. When a test is 
used to measure the achievement of curriculum 
standards, it is essential to evaluate and docu- 
ment both the relevance of a test to the stan- 
dards and the extent to which it represents those 
standards. 3 

Similarly, researcher Robert Rothman has empha- 
sized that the validity of the test results depends on 
proper alignment: 

If a test measures only some of the expecta- 
tions the standards hold for all students, can 
a score on a test truly represent a measure of 
performance against the standards? ... If a test 
measures only some of the knowledge and skills 
expected for all students, what does a passing 
score indicate? Does it mean that students who 
attain the score have demonstrated proficiency 
on the test or on the standards? 4 

The validity problem is especially serious when 
standardized norm-referenced tests are relied upon for 
accountability purposes, with stakes attached to the 
results. The U.S. Department of Education’s Handbook 
for the Development of Performance Standards specifi- 
cally warns against this practice, since “it is unlikely 
that any ‘off-the-shelf test will fully align with the 
breadth and depth of a state’s or local system’s content 
standards.” 5 Furthermore, the results of norm-refer- 
enced tests compare students with other students and 
schools with other schools, rather than indicate what 



2 One way to ensure broad coverage of subject matter on a test is to use “matrix sampling.” In the National Assessment of Educational 
Progress (NAEP), for example, different samples of students take different sets of items; the compiled results accomplish what a single 
test of several hours would reveal. 

3 Research Points, Spring 2003, Volume 1, Issue 1, p. 4. 

4 Robert Rothman, "Benchmarking and Alignment of State Standards,” Redesigning Accountability Systems for Education, edited by Susan 
H. Furman and Richard F. Elmore, Teachers College, 2004, p. 112. 

5 Linda N. Hansche, with contributions by Ronald K. Hambleton, Craig N. Mills, Richard Jaeger, and Doris Redfield, Handbook for the De- 
velopment of Performance Standards: Meeting the Requirements of Title I, prepared for the U.S. Department of Education and the Council 
of Chief State School Officers, 1998, pp. 21-22. 
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students know relative to what they are supposed to 
know. 

Accepting that there must be alignment between 
the test and the content standards, the question then 
becomes, what is the current status? Are problems of 
misalignment widespread, or are they rare? According 
to the AERA review of alignment studies cited earlier: 

While specific findings may vary from study to 
study, all of the research points to one central 
conclusion: Alignment needs to be improved. 

For some extreme cases, studies have found that 
alignment between state standards and tests is 
so weak that the standards from one state more 
closely match the tests used in another state. 

Other research efforts have also concluded that 
alignment problems are fairly common. For example, 
a comprehensive study of the implementation of stan- 
dards in 2001 by the American Federation of Teachers 
(AFT) — a leading proponent of standards and tests 
with consequences — concluded that 44 percent of the 
states have tests that are not aligned to the standards. 6 

In another recent study, the Fordham Founda- 
tion graded the degree of alignment between the tests 
and the content standards in 22 states on a scale of 1 
through 5 (with 5 meaning outstanding performance, 
and 1 meaning poor performance). Researchers found 
that the average rating was 3, or “fair.” The lowest 
states were Hawaii (1.5) and New Mexico (1.8). The 
highest were Georgia, Pennsylvania, and Virginia, all 
with 3.8. All of the top-performing states were using 
tests that were both state-developed and criterion ref- 
erenced. 7 

Some researchers have developed criteria that 
are not only useful in evaluating alignment between 
tests and content standards, but also in improving 
that alignment. One important project of this nature 
has been the work of Achieve, Inc. staff and Lauren 
Resnick of the University of Pittsburgh. They began 
by examining the typical process that test developers 



use to show, usually in a matrix presentation, how the 
items or tasks on the test match the statement in the 
content standards. The authors argued that while “this 
[approach] seems pretty straightforward, ... it masks 
myriad difficulties.” 

At the request of several states wanting to improve 
their processes, the authors set out to develop a meth- 
odology. They stated their purposes as follows: 

We wanted to judge not only the quantity of 
individual items, but also the overall qualities of 
the tests — range, balance, and degree of chal- 
lenge — represented by the set of test items as a 
whole. Furthermore, we sought a method that 
recognized that alignment is not an attribute 
of either standards or assessments, per se, but 
rather of the relationship between them. And be- 
cause it describes the match between standards 
and assessments, alignment can legitimately be 
improved by altering either one of them or both. 8 

The authors established four dimensions for re- 
viewing alignment: 

• Content Centrality “provides a deeper analysis of 
the match between the content of each test ques- 
tion and the content of the related standard by 
examining the degree or quality of the match.” 

• Performance Centrality “focuses on the degree of 
the match between the type of performance (cog- 
nitive demand) presented by each test item and 
the type of performance described by the related 
standard.” 

• Challenge is a criterion that is “applied to a set of 
items to determine whether doing well on these 
items requires students to master challenging sub- 
ject matter.” 

• Balance and Range addresses whether the tests 
cover “the full range of standards with an appropri- 
ate balance and emphasis across the standards.” 



6 American Federation of Teachers, Making Standards Matter 2001 . 

t Richard W. Cross, Theodore Rebarber, Justin Torres, and Chester E. Finn, Jr. (editors), Grading the Systems, the Guide to State Standards, 
Tests, and Accountability Policies, Thomas B. Fordham Foundation, January 2004. States were graded on six components, including the 
alignment between test and content standards. Often, the basis for rating a state did not exist. To understand these ratings, readers are 
advised to look at the specific criteria used in the evaluations. 

8 Robert Rothman, Jean B. Slattery, Jennifer L. Vranek and Lauren B. Resnick, Benchmarking and Alignment of Standards and Testing, CSE 
Technical Report 566, Center for the Study of Evaluation, Graduate School of Education and Information Studies, University of Califor- 
nia, Los Angeles, May 2000. 
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When Robert Rothman and his colleagues applied 
these criteria to tests and standards in five states, they 
concluded the following: 

They have, for the most part, limited their tests 
to material that is in the standards — a primary 
requirement for a fair accountability test . . . 
Further — at least after our correction of the test 
blueprint — included test items are generally 
quite well aligned to the standards or objective 
to which they are mapped. 

But the good news ends here. With few excep- 
tions, the collections of items that make up 
the tests we examined do not do a good job of 
assessing the full range of standards and objec- 
tives that states have laid out for their students. 
What is included and excluded is systematic: 
the most challenging standards and objectives 
are the ones that are under-sampled or omitted 
entirely . . . Thus, many of the tests in use by a 
state cannot be judged to be aligned to the state’s 
standards — even though most of the items map 
to some standard or objective. 

Achieve, Inc. subsequently worked with individual 
states to improve alignment between tests and content 
standards. 

The Council of Chief State School Officers has co- 
ordinated another effort to assist states in measuring 
the degree of alignment between tests and standards, 
and in developing tests that reflect what the content 
standards require. The results were reported in a study 
by Norman L. Webb. 9 

The first step was to develop the criteria for deter- 
mining alignment between the test and the content 
standards, and to train reviewers in the process of de- 
termining alignment. Four criterion were established: 

a. Categorical Concurrence . This criterion “provides 
a very general indication of whether both docu- 
ments incorporate the same content. The criterion 
of categorical concurrence between standards and 
assessment is met if the same or consistent cat- 
egories of content appear in both documents. This 
criterion was judged by determining whether the 
assessment included items measuring content from 
each standard.” 



b. Depth-of-Knowledge Consistency . “Depth-of- 
knowledge consistency between standards and 
assessments indicates alignment if what is elicited 
from students on the assessment is as demanding 
cognitively as what students are expected to know 
and do as stated in the standards.” Four depth-of- 
knowledge levels were defined for each of the four 
content areas (reading, writing, math, and social 
studies), and elaborated on in considerable detail: 

• Level 1 . Recall or simple reproduction of 
information. 

• Level 2 . Use of skills and concepts. 

• Level 3 . Strategic thinking. 

• Level 4 . Extended thinking. 

c. Range-of- Knowledge Correspondence . “The range- 
of-knowledge criterion is used to judge whether a 
comparable span of knowledge expected of stu- 
dents by a standard is the same as, or corresponds 
to, the span of knowledge that students need in 
order to correctly answer the assessment items/ac- 
tivities.” 

d. Balance of Representation . “The balance-of-repre- 
sentation criterion is used to indicate the degree to 
which one objective is given more emphasis on the 
assessment than another. An index is used to judge 
the distribution of assessment items.” 

A Source of Challenge Criterion “is used only to 
identify items on which the major cognitive demand 
is inadvertently placed and is other than the targeted 
construct (skill, concept or application).” 

An example of a summary for a state in one grade 
and subject area is provided in Table 1, for language 
arts in grade 1 1 for state F. 

In different grades, states, and subjects, the degree 
of alignment varied considerably. The states volun- 
teering to participate in the study were undergoing 
reviews and changes in their programs, and the study 
gave them information they could use. The degree of 
alignment in these states has likely changed since the 
study. 



9 Norman L. Webb, Alignment Study in Language Arts, Mathematics, Science, and Social Studies of State Standards and Assessments for Four 
States, Council of Chief State School Officers, December 2002. 
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Table 1: 

Is the Alignment Acceptable? 
Language Arts in Grade 1 1, State F 



Standards 


Categorical 

Concurrence 


Depth-of- 

Knowledge 

Consistency 


Range-of- 

Knowledge 

Correspondence 


Balance of 
Representation 


1 . Reading Process 


Yes 


Weak 


Yes 


Yes 


2. Responding to Text 


Yes 


No 


Yes 


Weak 


3. Information and Research 


No 


Insufficient 

Number 


Insufficient 

Number 


Insufficient 

Number 


4. Grammar/Usage and Mechanics 


Yes 


Yes 


Yes 


Yes 


5. Literature 


No 


No 


Yes 


Yes 



Source: Norman L. Webb, Alignment Study in Language Arts, Mathematics, Science, and Social Studies of State Standards and Assessments for Four States, Council of Chief State School 
Officers, December 2002. 



In summary, much work has been done to enable 
matching tests to content standards, and to determine 
whether the desired match has been achieved. Some 
advocates of high-stakes testing are inclined to view 
this work as an effort by experts to erect impediments 
to action and results. But if we think about the matter 
in common sense terms, it is clear that alignment is 
vitally important. Content standards represent what 
education policy makers want students to know and 
be able to do. The test is an instrument to see whether 
or not that goal is achieved. If the test is not “done 
right,” then the question of whether the goal is reached 
remains unanswered — and the risk of undue nega- 
tive consequences for teachers and schools is greatly 
increased. 

Alignment of Instruction to Content Standards, 
and Tests to Instruction. If the accountability test is 
aligned with the content standards, and if the curricu- 
lum actually in use in the classroom is also faithful to 
the content standards, then the test can help tell if the 
standards are being mastered. In other words, when 
what is taught and what is tested are both connected 
to the content standards, the three pieces fit together. 
Where they don’t, a number of things may happen. 



• The test may be measuring achievement of the 
state content standards and students do poorly 
because that is not what they are taught; 

• Curriculum and instruction may be addressing 
the state content standards, but since the test is 
not aligned to them, it does not measure achieve- 
ment of the standards and it is hard to tell what the 
scores mean; 

• The test may be measuring what is taught but not 
the achievement of the state content standards, 
because what is taught is not aligned to the content 
standards; 

• What is taught does not align with the content stan- 
dards, and neither does the test, so no one knows 
what the test is measuring. 

These scenarios are not equally likely to happen. 
Furthermore, it takes a brave teacher to teach the 
content standards when the test doesn’t match them, 
as in the second approach. More often, it appears that 
teachers shift their instruction to what is going to be 
tested, knowing that their (or their school’s) effective- 
ness may be judged according to students’ perfor- 
mance. This is discussed later in the report. 
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